Exploratory Analysis of table actor¶

Filter unwanted columns¶

According to the wiki page, we can get rid of those columns:

  • standard_text_property
  • count_text_property
  • concat_names

Table extract¶

pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
41789 8007 Actr8007 Blanchard, Jules 1874.0 1 None NaN 1 None 1 None 104.0 11.0 2008-07-18 19:32:45.000 11.0 2013-12-18 15:24:16
19817 24052 Actr24052 Souron, Charles 1870.0 3 None NaN 1 None 1 None 104.0 11.0 2009-12-04 11:23:03.000 11.0 2013-12-18 15:24:16
43955 22077 Actr22077 Tabouis, Robert 1889.0 1 None 1973.0 1 2 1 None 104.0 11.0 2009-05-20 11:40:46.000 11.0 2013-12-18 15:24:16
53239 58923 Actr58923 Pacotte, François 1703.0 3 3 NaN 1 None 1 None 104.0 101.0 2016-08-07 21:20:41.480 101.0 2016-08-07 21:20:42
32188 33753 Actr33753 De Coninck, Peter Damasius 1620.0 3 None NaN 1 None 1 None 104.0 27.0 2010-05-02 16:50:06.000 3.0 2013-12-18 15:24:16

Filter only wanted rows¶

Some of the rows has been identified to not be imported (see this wiki page).

Rows number before filter: 61556
Rows number after filter: 59526 (2030 have been removed)

Filter by Actor type¶

For now we are interested only in persons.

Persons can be found by having the column fk_abob_type_actor being 104.

Number of not 104 actors: 3

pk_actor concat_actr concat_standard_name begin_year certainty_begin notes_begin end_year certainty_end notes_end gender_iso notes fk_abob_type_actor creator creation_time modifier modification_time
10340 59031 Actr59031 Forster, James 1830.0 3 3 1930.0 3 3 1 None 106.0 81.0 2016-11-29 11:05:00.060 81.0 2016-11-29 11:05:00
28940 60660 Actr60660 Valjean, Jean 1769.0 1 None 1833.0 1 None 1 None 106.0 122.0 2018-10-23 16:48:50.050 122.0 2018-10-23 16:48:50
46002 46914 Actr46914 Dieu (conception chrétienne) NaN 1 None NaN None None 0 None 106.0 3.0 2013-07-04 11:43:15.990 3.0 2013-12-18 15:24:16

Discovery¶

Columns contain:
Total number of rows: 59523
  -             "pk_actor":   0.00% empty - 59523 (100.00%) uniques (eg: 44895; 47015)
  -          "concat_actr":   0.00% empty - 59523 (100.00%) uniques (eg: Actr44895; Actr47015)
  - "concat_standard_name":   0.00% empty - 56550 ( 95.01%) uniques (eg: Sainte-Mar...; Costantino...)
  -           "gender_iso":   0.00% empty -     3 (  0.01%) uniques (eg: 1; 2)
  -        "creation_time":   0.00% empty - 34441 ( 57.86%) uniques (eg: 2012-04-08...; 2013-07-26...)
  -    "modification_time":   0.00% empty - 13973 ( 23.47%) uniques (eg: 2013-12-18...; 2016-10-21...)
  -              "creator":   0.01% empty -    88 (  0.15%) uniques (eg: 43.0; 30.0)
  -             "modifier":   8.92% empty -    85 (  0.14%) uniques (eg: 2.0; 30.0)
  -      "certainty_begin":   9.42% empty -     4 (  0.01%) uniques (eg: 3; 1)
  -        "certainty_end":  14.48% empty -     5 (  0.01%) uniques (eg: 3; None)
  -           "begin_year":  18.56% empty -   847 (  1.42%) uniques (eg: 1870.0; 1506.0)
  -             "end_year":  50.66% empty -   819 (  1.38%) uniques (eg: 1930.0; 1545.0)
  -          "notes_begin":  67.74% empty -     5 (  0.01%) uniques (eg: 3; 2)
  -            "notes_end":  72.41% empty -     6 (  0.01%) uniques (eg: 3; 4)
  -                "notes":  89.85% empty -  6012 ( 10.10%) uniques (eg: <p>Il s'ag...; None)

Type parsing¶

According to the table before, we will parse each column by the most meaningful type.

Columns analysis¶

Here we will report the analysis of interesting information found on different columns. They are not exhaustive.

For some columns, we will update their value.

gender_iso¶

We observe some of the gender values being undefined. As the ISO mentions, it should be 0, 1, 2 or 9. So we replace the undefined gender by 0.

certainty_begin¶

We replace the not filled values by 0.

begin_year¶

certainty_end¶

We replace the not filled values by 0.

end_year¶

creation_time¶

creator¶

notes¶

All HTML tags, non ASCII chars and new line are removed.